IODS (Introduction to Open Data Science) is a MOOC offered by University of Helsinki. My goal is to learn new R tips and tricks and to become more organized and productive in data analysis tasks.
Source of this learning diary is available at my Github repo.
Last updated: 02.12.2018
The following analysis uses dataset (n=166) gathered using ASSIST: the appproaches and study skills inventory for studets -survey. For more information regarding the dataset, please see Vehkalahti (2016).
The purpose of the analysis is to explore, whether is it possible to predict student’s test performance using composite variables depicting student’s mental inclination towards studying, namely:
These four composite variables were created from the set of 42 variables. Original variables were measured using five-point Likert scale and composite variables were scaled back to the five-point range.
From the explatory graph we can see that the gender and age distribution is rather skewed in the data. There are almost 50% more female students than males – especially among the students in their twenties. Even though the mean age is 25.51, there are lots of older students too, the maximum age being 55.
Visual inspection indicates that even though there are slight differences in the attitude and learning approch scores (deep, surface, strategic) between males and females it seems that these differences do not have an effect on median test points. From the figure below we can see that the bulk of the data sits right in the middle of the scale so that there are very few datapoints in the ends of the scales.
In building linear regression models, it is desirable to have high correlations between the prediction covariates and the response variable, but small correlations between the different prediction covariates. Large correlations between prediction variables leads to the problem of collinearity in linear regression, eg. the model might overfit the data.
Based on explorative analysis in Fig 1, three variables were chosen as prediction covariates.
For the first model, the following variables were chosen as predictors.
Attitude and strategic learning approach do not correlate (0.06), strategic learning and surface learning approach correlate slightly (-0.16) as well as surfarce learning approach and attitude (-0.17).
| Predictor | Estimate | Confidence Interval (95%) | t-statistic | p-value |
|---|---|---|---|---|
| Intercept | 11.02 | \([3.74\), \(18.29]\) | 2.99 | .003 |
| Attitude | 3.40 | \([2.26\), \(4.53]\) | 5.91 | < .001 |
| Stra | 0.85 | \([-0.22\), \(1.92]\) | 1.58 | .117 |
| Surf | -0.59 | \([-2.17\), \(1.00]\) | -0.73 | .466 |
F Test summary: (R2=0.2074, F(3,162)=14.13, p<1e-05). Adjusted R2 is 0.19.
R2 is the percentage of the dependent variable variation explained by the linear model. High R-squared values represent smaller differences between the observed data and the fitted values. Having R2 = 0.2074, our first model explains 20% of the observed variation in student’s test scores.
From the summary we can see, that the surface learning variable (surf) doesn’t have statistically significant relationship with test points (p=0.446). Neither Strategic learning approach -variable (stra) is significant (p=0.177) so we’ll exclude them from the final model.
| Predictor | Estimate | Confidence Interval (95%) | t-statistic | p-value |
|---|---|---|---|---|
| Intercept | 11.64 | \([8.02\), \(15.25]\) | 6.36 | < .001 |
| Attitude | 3.53 | \([2.41\), \(4.65]\) | 6.21 | < .001 |
For the final model F Test summary: (R2=0.1906, F(1,164)=38.61, p<1e-05). Adjusted R2 is 0.19.
We see that R2 stays the same for the revised model. Based on the model coefficients (estimate) and 95% confidence interval for the estiate, we see that for the one unit increase in attitude (predictor) variable, the increase in points is between 2.41 and 4.65, average increase being 3.53 points.
Regression diagnostics plots help to observe, if the model meets the assumptions of linear regression:
Residuals vs Fitted -plot shows if residuals have non-linear patterns. If the residuals have constant variance of errors, the plot scatters randomly without any distict patterns. We don’t see patterns here, so we can conclude that the variance is constant and the assumption of equal variance of residuals is met.
Normal Q-Q -plot shows if residuals are normally distributed. If the residuals deviate from the straight line it is an indication of non-normality. There are few outlying values in both ends of the line but majority of of the points line nicely and therefore we can conclude that the the residuals of the model are normally distributed.
Residual vs. leverage -plot helps to find influential cases. Not all outliers have leverage in linear regression analysis. Values to observe are at the upper right corner or at the lower right corner – they are the ones which can have an effect on regression results. We don’t see outlying values in these critical spots, so we can conclude that there is no influencial individual cases in the data.
Even though we just concluded that there’s no influencial individual cases in the data, we can see from the plots that there are few outlying cases in non-critical spots of the Residual vs. leverage plot. What happens if we exclude them from the data? To do it, we have to single out the cases from the data using Cook’s distance.
Just for the sake of curiosity, let’s investige how excluding three outlier cases (36, 56 & 71) has an effect on our regression model.
learning2014$id <- 1:dim(learning2014)[1]
model_attitude_stra_excluded <- lm(
data = learning2014 %>% filter(!id %in% c(35, 56, 71)),
formula = points ~ attitude)| Predictor | Estimate | Confidence Interval (95%) | t-statistic | p-value |
|---|---|---|---|---|
| Intercept | 10.24 | \([6.81\), \(13.68]\) | 5.89 | < .001 |
| Attitude | 4.01 | \([2.95\), \(5.08]\) | 7.43 | < .001 |
For the model created using data without cases 35, 56, and 71, the F Test summary: (R2=0.2555, F(1,161)=55.26, p<1e-05). Adjusted R2 is 0.25.
The fit of the model seems to be marginally better when three outlying cases are excluded. Let’s see from the data what kind of cases these are so that we could decide if it is prudent to consider them outliers.
| Gender | Age | Attitude | Deep | Strategic | Surface | Points |
|---|---|---|---|---|---|---|
| M | 20 | 4.2 | 4.50 | 3.25 | 1.58 | 10 |
| F | 45 | 3.8 | 3.00 | 3.12 | 3.25 | 9 |
| F | 23 | 1.9 | 4.33 | 2.75 | 2.92 | 29 |
Looking the data, two unfortunate students have scored very low in the test, even though their motivation is strong. It might be that some external cause has affected their test performance as other low-scoring (points <=10) students somewhat differ from these cases. There is also one student whose attitude score is very low but who has achieved high points in tests.
| Gender | Age | Attitude | Deep | Strategic | Surface | Points |
|---|---|---|---|---|---|---|
| M | 53 | 3.5 | 3.50 | 3.12 | 2.25 | 10 |
| M | 29 | 1.7 | 4.08 | 3.00 | 3.75 | 9 |
| F | 37 | 2.3 | 3.67 | 2.75 | 2.42 | 9 |
| M | 20 | 4.2 | 4.50 | 3.25 | 1.58 | 10 |
| F | 45 | 3.8 | 3.00 | 3.12 | 3.25 | 9 |
| F | 21 | 3.5 | 3.92 | 3.88 | 3.50 | 7 |
Our goal if to predict the level of alcohol usage based on data collected using school reports and questionnaires from two Portugese secondary-level schools. The original dataset (n=382) has the following variables:
In addition, the following variables were created for the analysis
Numerical abseentism score was transformed to five-class categorical variable (0–1, 2, 3–4, 5–7, >8) for the purpose of the analysis.
Because there are as many as 27 variables in the final dataset, choosing interesting and hopefully fruitful variables just by guessing is not likely to succeed. Therefore, let’s use Goodman-Kruskal tau to explore associations between variables in the dataset.
We have to use tau for this, instead of familiar Pearson correlation, as it doesn’t give us useful results when variables are not continuous. It should be noted, that even though many of the variables in the dataset are numerical (such as studytime, frequency of going out and amount of abseentism), according to the dataset description they are actually ordered categorical variables and therefore using tau to study association is sensible choice.
It can be seen from the initial Goodman-Kruskal tau-matrix that the association between heavy drinking and other variables is strongest between frenquency of going out with friends, amount of abseentism, time dedicated for studying and gender. It should be noted that the association with gender is as strong as with freetime, but intuitively if feels more reasonable study the association of gender to the drinking habits than the amount of free time.
Based on observed associations in the data and personal experience, a reasonable initial hypothesis would be:
Going often out with friends and is correlated with high drinking. Gender doesn’t make signinificant difference, even tough male students tend to drink more. Time dedicaded to studies is inversely associated with high alcohol use but igh abseentism has no significant association with heavy drinking
Next we’ll observe the relationship of variables using mosaic plots, which allow us to explore the data in multiple dimensions at the same time.
Looking the plots it seems that our initial hypothesis was somewhat inaccurate. Differences between genders are at least visually observing large. Going often out with friends is indeed associated with high level of alcohol use, especially when the abseentism is common (top right corner of the first plot). On the other hand, the size of this group is not very large, so this three-way association might not be strong. Time dedicated to studies is inversely correlated with high alcohol use and amount of abseentism (top right corner).
##
## Call:
## glm(formula = high_use ~ goout + sex + absnt + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0011 -0.7391 -0.4707 0.7279 2.6452
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.34971 0.69745 -3.369 0.000754 ***
## gooutlow 0.39022 0.70228 0.556 0.578449
## gooutmod 0.63488 0.69198 0.917 0.358895
## goouthigh 2.16469 0.69593 3.110 0.001868 **
## gooutvhigh 2.38300 0.71257 3.344 0.000825 ***
## sexM 0.76948 0.27549 2.793 0.005220 **
## absnt2 0.41615 0.41509 1.003 0.316074
## absnt3–4 0.03195 0.38525 0.083 0.933897
## absnt5–7 -0.20080 0.42637 -0.471 0.637679
## absnt>8 1.05424 0.36589 2.881 0.003961 **
## studytime2–5 h -0.44563 0.30751 -1.449 0.147302
## studytime5–10 h -0.91718 0.48515 -1.891 0.058689 .
## studytime>10 h -1.01073 0.62811 -1.609 0.107583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 374.51 on 369 degrees of freedom
## AIC: 400.51
##
## Number of Fisher Scoring iterations: 4
For our initial variables, frequency of going out with friends, gender, level of abseentism and time dedicated to studies make a statistically significant effect on level of drinking
Looking the odds ratios of the model, it can be stated that being male increases the odds of high alcohol usage over 2 times compared to females, going out frequently or very frequently increses the odds 8-10 times compared to the students who go out very little or not at all. High abseentism also increases the odds of high alcohol consumption 2.87 times compared to the studets who are not in the habit of abseentism.
These model-based observations align pretty well on the conclusions which were made based on the initial visual exploration of the data.
| Predictor | Estimate | CI (95%) | t | p-value | Odds ratio |
|---|---|---|---|---|---|
| gooutmod | 0.63 | \([-0.61\), \(2.18]\) | 0.92 | .359 | 1.89 |
| goouthigh | 2.16 | \([0.92\), \(3.72]\) | 3.11 | .002 | 8.71 |
| gooutvhigh | 2.38 | \([1.10\), \(3.97]\) | 3.34 | .001 | 10.84 |
| sexm | 0.77 | \([0.23\), \(1.32]\) | 2.79 | .005 | 2.16 |
| abseentism>8 | 1.05 | \([0.34\), \(1.78]\) | 2.88 | .004 | 2.87 |
| studytime5–10 h | -0.92 | \([-1.92\), \(0.00]\) | -1.89 | .059 | 0.40 |
alc_glm_second <- glm(
formula = high_use ~ goout + sex + absnt,
family = "binomial",
data = alc)
summary(alc_glm_second)##
## Call:
## glm(formula = high_use ~ goout + sex + absnt, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9016 -0.7054 -0.4885 0.6704 2.4222
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.72000 0.67394 -4.036 5.44e-05 ***
## gooutlow 0.20910 0.69706 0.300 0.764194
## gooutmod 0.42097 0.68290 0.616 0.537598
## goouthigh 1.99006 0.68527 2.904 0.003684 **
## gooutvhigh 2.24070 0.70637 3.172 0.001513 **
## sexM 0.94468 0.26159 3.611 0.000305 ***
## absnt2 0.44525 0.41055 1.085 0.278133
## absnt3–4 0.09012 0.37728 0.239 0.811208
## absnt5–7 -0.15888 0.41776 -0.380 0.703710
## absnt>8 1.16366 0.36153 3.219 0.001288 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 379.98 on 372 degrees of freedom
## AIC: 399.98
##
## Number of Fisher Scoring iterations: 4
| predicted low | predicted high | |
|---|---|---|
| low | 245 | 23 |
| high | 58 | 56 |
| predicted low | predicted high | Sum | |
|---|---|---|---|
| low | 64.1 | 6.0 | 70.1 |
| high | 15.2 | 14.7 | 29.9 |
| Sum | 79.3 | 20.7 | 100.0 |
Crosstables indicate that model tends to overpredict the proportion of students who are low-drinkers (actual) and underpredict the proportion of high-drinking students.
| Predictor | Estimate | CI (95%) | t | p-value | Odds ratio |
|---|---|---|---|---|---|
| goouthigh | 1.99 | \([0.76\), \(3.53]\) | 2.90 | .004 | 7.32 |
| gooutvhigh | 2.24 | \([0.97\), \(3.81]\) | 3.17 | .002 | 9.40 |
| sexm | 0.94 | \([0.44\), \(1.47]\) | 3.61 | < .001 | 2.57 |
| abseentism>8 | 1.16 | \([0.46\), \(1.88]\) | 3.22 | .001 | 3.20 |
For the final model, the odds ratio of high alcohol use increasse 2.57 times when the student is male, 3.20 times when number of school absensces is higher than 8 (compared to zero to one absences), and 7.32-9.40 times when the student goes out with friends frequently or very frequently, compared to going out very little or no at all.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
training_error <- loss_func(class = alc$hu, prob = alc$predict_high_use)
training_error_simple <- loss_func(class = alc_simple$hu, prob = alc_simple$predict_high_use)The model has 0.212 training error. Error for initial model is 0.204 so it seems that keeping studytime predictor in the model would have been useful.
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = alc_glm_second, K = 10)
cv_simple <- cv.glm(data = alc_simple, cost = loss_func, glmfit = alc_glm_first, K = 10)Based on the 10-fold cross-validation, the prediction error for the model is 0.22 so it seems that this model performs a bit better than the model introduced in the DataCamp which had 0.26 error. Surprisingly, prediction error for the initial model is 0.22, so it is the best model of all three. To conlude, this somewhat unexpected result idicates that it might be useful to keep also marginally signitficant predictors in the model, as was the case with studytime predictor.
Our goal for the fourth week is to explore census data collected in 1970 from the Boston metropolitan area. Our first step is to try to predicting the crime rate in the Boston area using plethora of predictor variables. Finally we’ll see what kind of clusters we can find from the dataset when running k-means clustering algorithm over the data.
The dataset is discussed more deeply in Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management 5, 81–102 (PDF)
## Skim summary statistics
## n obs: 506
## n variables: 14
##
## ── Variable type:integer ────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## chas 0 506 506 0.069 0.25 0 0 0 0 1 ▇▁▁▁▁▁▁▁
## rad 0 506 506 9.55 8.71 1 4 5 24 24 ▂▇▁▁▁▁▁▅
##
## ── Variable type:numeric ────────────
## variable missing complete n mean sd p0 p25 p50
## age 0 506 506 68.57 28.15 2.9 45.02 77.5
## black 0 506 506 356.67 91.29 0.32 375.38 391.44
## crim 0 506 506 3.61 8.6 0.0063 0.082 0.26
## dis 0 506 506 3.8 2.11 1.13 2.1 3.21
## indus 0 506 506 11.14 6.86 0.46 5.19 9.69
## lstat 0 506 506 12.65 7.14 1.73 6.95 11.36
## medv 0 506 506 22.53 9.2 5 17.02 21.2
## nox 0 506 506 0.55 0.12 0.39 0.45 0.54
## ptratio 0 506 506 18.46 2.16 12.6 17.4 19.05
## rm 0 506 506 6.28 0.7 3.56 5.89 6.21
## tax 0 506 506 408.24 168.54 187 279 330
## zn 0 506 506 11.36 23.32 0 0 0
## p75 p100 hist
## 94.07 100 ▁▂▂▂▂▂▃▇
## 396.23 396.9 ▁▁▁▁▁▁▁▇
## 3.68 88.98 ▇▁▁▁▁▁▁▁
## 5.19 12.13 ▇▅▃▃▂▁▁▁
## 18.1 27.74 ▃▆▅▁▁▇▁▁
## 16.96 37.97 ▆▇▆▅▂▁▁▁
## 25 50 ▂▅▇▆▂▂▁▁
## 0.62 0.87 ▇▆▇▆▃▅▁▁
## 20.2 22 ▁▂▂▂▅▅▇▃
## 6.62 8.78 ▁▁▂▇▇▂▁▁
## 666 711 ▃▇▂▅▁▁▁▆
## 12.5 100 ▇▁▁▁▁▁▁▁
The dataset has 506 observations and 14 variables. My guess is that observations correspond to cencus districts, eg. subportions of town, as it is very unlikely that Boston metropolitan has over 500 towns.
| de De | scription |
|---|---|
| crim | per capita crime rate by town. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| nox | nitrogen oxides concentration (parts per 10 million). |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| tratio | pupil-teacher ratio by town. |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| lstat | lower status of the population (percent). |
| medv | median value of owner-occupied homes in $1000s. |
Renaming variables makes it easier in later steps to discuss our findings.
boston_column_names <- c(
'per_capita_crime',
'proportion_big_lots',
'proportion_industry',
'river_near',
'nox_concentration',
'rooms_per_dwelling',
'proportion_old_houses',
'mean_distance_to_centers',
'highway_access',
'tax_value',
'pupil_teacher_ratio',
'proportion_blacks',
'lower_status',
'house_median_value'
)
names(boston) <- boston_column_namesLet’s have a closer look of variables which have higehst correlation with criminality.
Looking the data my intuition is that low-status (likely working-class) housing near highways and industrial areas are more prone to have a higher crime rate.
# Create a categorical variable of the crime rate in the
# Boston dataset (from the scaled crime rate). Use the quantiles as the break
# points in the categorical variable. Drop the old crime rate variable from the
# dataset. Divide the dataset to train and test sets, so that 80% of the data
# belongs to the train set. (0-2 points)
boston_scaled %<>%
mutate(
crime_cat = quantcut(
per_capita_crime,
labels = c("very low","low","high","very high")
)
) %>% select(-per_capita_crime)
# Let's use Caret to create training dataset and make sure that
# the overall class distribution is preserved
# eg. the random sampling occurs within each class
rows_to_train <- caret::createDataPartition(boston_scaled$crime_cat, p = .8,
list = FALSE,
times = 1)
boston_training <- boston_scaled[rows_to_train,]
boston_test <- boston_scaled[-rows_to_train,]The model predicts the correct class with 73 % accuracy. Location near to highway entraces seems to contribute most to the criminality.
| very low | low | high | very high | |
|---|---|---|---|---|
| very low | 13 | 10 | 2 | 0 |
| low | 8 | 15 | 2 | 0 |
| high | 1 | 11 | 11 | 2 |
| very high | 0 | 0 | 1 | 24 |
Plot represents the variance within the clusters. It decreases as k increases, but one can notice a bend (or “elbow”) at ~2. This bend indicates that additional clusters beyond this point have only little value. So, let’s fit k-means clustering with k=2.
Looking the plots, our two clusters correspond pretty nicely to our intuition of crime-prone neighbourhoods residing near high-pollution industrial areas and highways. More affluent cluster 1 (colour black) has less criminality. Cluster 2 has more pollution, is located near to highways, has factories nearby and has more criminality.
| cluster | proportion_industry | per_capita_crime | nox_concentration | proportion_old_houses | highway_access | tax_value | lower_status |
|---|---|---|---|---|---|---|---|
| 1 | -0.6146857 | -0.3894158 | -0.5823397 | -0.4331555 | -0.5828749 | -0.6291043 | -0.4530491 |
| 2 | 1.1425514 | 0.7238295 | 1.0824279 | 0.8051309 | 1.0834228 | 1.1693521 | 0.8421083 |
This week we’ll try out dimensionality reduction techniques, namely principal components analysis (PCA) and multiple correspondence analysis (MCA) which can be used to explore multidimensional data as points in a low-dimensional space. PCA is applicable for continous data and MCA can be used when the measurement level of variables is at least categorical.
The dataset we’ll use is from the United Nations Development Programme. For the following analysis, the dataset was preprocessed using an external script.
The dataset has the following variables:
| de De | scription |
|---|---|
| country | name of the country (as row name) |
| edu_exp | expected years of education |
| sec_edu_female | proportion of females at least 2nd level education |
| labour_rate_female | proportion of women participating in working life |
| life_exp | life expectancy at birth |
| gni | gross national income |
| mat_mortality | maternal mortality ratio |
| birth_rate | the number of births to women ages 15–19 (per 1000 women) |
| repr_parlament | percetange of female members of the parliament |
Most of the variable distributions are more-or-less skewed, so it is very likely that PCA will not work without normalizing the data. Only the edu_exp (expected years of education) variable looks normally distributed.
| Statistic | Min | Pctl(25) | Median | Pctl(75) | Max | Median | St. Dev. |
| edu_exp | 5,40 | 11,25 | 13,50 | 15,20 | 20,20 | 13,50 | 2,84 |
| sec_edu_female | 0,90 | 27,15 | 56,60 | 85,15 | 100,00 | 56,60 | 30,93 |
| labour_rate_female | 13,50 | 44,45 | 53,50 | 61,90 | 88,10 | 53,50 | 16,04 |
| life_exp | 49,00 | 66,30 | 74,20 | 77,25 | 83,50 | 74,20 | 8,33 |
| gni | 581 | 4197,5 | 12040 | 24512 | 123124 | 12040 | 18543,85 |
| mat_mortality | 1 | 11,5 | 49 | 190 | 1100 | 49 | 211,79 |
| birth_rate | 0,60 | 12,65 | 33,60 | 71,95 | 204,80 | 33,60 | 41,11 |
| repr_parlament | 0,00 | 12,40 | 19,30 | 27,95 | 57,50 | 19,30 | 11,49 |
Unexpectedly, maternal mortality and life expectency are highly correlated (-0.86) as well as high birth rate and maternal mortality (0.76). Variables regarding education have also strong correlation with maternal mortality, so that higher values in education variables, lead to lower value in maternal mortality rate. Secondary education of women have an association to lower birth rate (-0.68), as well as higher life expectancy (-0.72). Quite surprisingly, women’s representation in the parliament doesn’t seem to have an association with any of other variables or the association is very low. As one might expect, maternal mortality and high birth rate are associated with low GNI.
pca_summary_rows <- c(
"Standard deviation",
"Eigenvalue",
"Proportion of Variance",
"Cumulative Proportion")
pca_human <- prcomp(human)
pca_summary <- pca(pca_human)
rownames(pca_summary) <- pca_summary_rows| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | |
|---|---|---|---|---|---|---|---|---|
| Standard deviation | 18544.2 | 186.3 | 26.0 | 20.1 | 14.3 | 10.6 | 3.7 | 1.4 |
| Eigenvalue | 343886317.3 | 34701.6 | 674.6 | 403.0 | 205.1 | 113.1 | 13.8 | 2.0 |
| Proportion of Variance | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Cumulative Proportion | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
As expected, PCA with non-standardized data doesn’t make much sense as it’ll happen that the variable will largest variance explains virtually all of the variance and therefore creating a biplot will fail miserably.
# Use built-in scaling in prcomp() -function
pca_human_final <- prcomp(human, scale = TRUE, center = TRUE)
pca_summary_final <- pca(pca_human_final)
rownames(pca_summary_final) <- pca_summary_rows| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | |
|---|---|---|---|---|---|---|---|---|
| Standard deviation | 2.14 | 1.12 | 0.88 | 0.72 | 0.55 | 0.53 | 0.45 | 0.34 |
| Eigenvalue | 4.56 | 1.25 | 0.77 | 0.52 | 0.31 | 0.28 | 0.20 | 0.11 |
| Proportion of Variance | 0.57 | 0.16 | 0.10 | 0.06 | 0.04 | 0.04 | 0.03 | 0.01 |
| Cumulative Proportion | 0.57 | 0.73 | 0.82 | 0.89 | 0.93 | 0.96 | 0.99 | 1.00 |
After scaling, PCA results will make sense. Because in the original data, variance and distribution vary wildly, using PCA in non-standardized data will yield strange results as PCA as a analysis technique is sensitive to to the relative scaling of variables.
The first principal component explains 57.03% of the total variance of the dataset, and the second principal component explains 15.6% of it. It seems that the first principal component represents classical welfare state, eg. healthcare and access to education. The second principal component represents gender equality, eg. do women have equal rights to participate in politics and working life. Looking the biplot, it is evident that having an all-embracing welfare state doesn’t necessarily make a country gender-equal and vice-versa, Qatar (and other oil-rich states) and Rwanda exemplifying this phenomenon.
Our initial exploration of the data using a correlation matrix revealed the complex relationship between variables partially but usage of PCA biplot makes it possible to understand the whole dataset in an intuitive way almost at a glance.
The following questions which were used to gather the tea dataset can be found from the book Exploratory Multivariate Analysis by Example Using R, Second Edition, pages 131–132.
Location in wich the tea is drunk (yes/no):
Time of the day (yes/no):
Questions about the image of the product (yes/no):
20 Do you consider tea to be exotic? 21. Do you associate tea with spirituality? 22. Is tea good for your health? 23. Is tea a diuretic? 24. Do you associate tea with friendliness? 25. Does tea stop the body from absorbing iron? 26. Is tea feminine? 28. Is tea refined? 29. Will tea help you to lose weight? 30. Is tea a stimulant? 31. Is tea a relaxant? 32. Does tea have any effect on your overall health?
Additionally, there were four background questions regarding:
Skim summary statistics
n obs: 300
n variables: 36
| variable | missing | complete | n | n_unique | top_counts | ordered |
|---|---|---|---|---|---|---|
| age_Q | 0 | 300 | 300 | 5 | 15-: 92, 25-: 69, 45-: 61, 35-: 40 | FALSE |
| always | 0 | 300 | 300 | 2 | Not: 197, alw: 103, NA: 0 | FALSE |
| breakfast | 0 | 300 | 300 | 2 | Not: 156, bre: 144, NA: 0 | FALSE |
| dinner | 0 | 300 | 300 | 2 | Not: 279, din: 21, NA: 0 | FALSE |
| diuretic | 0 | 300 | 300 | 2 | diu: 174, Not: 126, NA: 0 | FALSE |
| effect.on.health | 0 | 300 | 300 | 2 | No.: 234, eff: 66, NA: 0 | FALSE |
| escape.exoticism | 0 | 300 | 300 | 2 | Not: 158, esc: 142, NA: 0 | FALSE |
| evening | 0 | 300 | 300 | 2 | Not: 197, eve: 103, NA: 0 | FALSE |
| exciting | 0 | 300 | 300 | 2 | No.: 184, exc: 116, NA: 0 | FALSE |
| feminine | 0 | 300 | 300 | 2 | Not: 171, fem: 129, NA: 0 | FALSE |
| frequency | 0 | 300 | 300 | 4 | +2/: 127, 1/d: 95, 1 t: 44, 3 t: 34 | FALSE |
| friendliness | 0 | 300 | 300 | 2 | fri: 242, Not: 58, NA: 0 | FALSE |
| friends | 0 | 300 | 300 | 2 | fri: 196, Not: 104, NA: 0 | FALSE |
| healthy | 0 | 300 | 300 | 2 | hea: 210, Not: 90, NA: 0 | FALSE |
| home | 0 | 300 | 300 | 2 | hom: 291, Not: 9, NA: 0 | FALSE |
| how | 0 | 300 | 300 | 3 | tea: 170, tea: 94, unp: 36, NA: 0 | FALSE |
| How | 0 | 300 | 300 | 4 | alo: 195, mil: 63, lem: 33, oth: 9 | FALSE |
| iron.absorption | 0 | 300 | 300 | 2 | Not: 269, iro: 31, NA: 0 | FALSE |
| lunch | 0 | 300 | 300 | 2 | Not: 256, lun: 44, NA: 0 | FALSE |
| price | 0 | 300 | 300 | 6 | p_v: 112, p_b: 95, p_u: 53, p_p: 21 | FALSE |
| pub | 0 | 300 | 300 | 2 | Not: 237, pub: 63, NA: 0 | FALSE |
| relaxing | 0 | 300 | 300 | 2 | rel: 187, No.: 113, NA: 0 | FALSE |
| resto | 0 | 300 | 300 | 2 | Not: 221, res: 79, NA: 0 | FALSE |
| sex | 0 | 300 | 300 | 2 | F: 178, M: 122, NA: 0 | FALSE |
| slimming | 0 | 300 | 300 | 2 | No.: 255, sli: 45, NA: 0 | FALSE |
| sophisticated | 0 | 300 | 300 | 2 | sop: 215, Not: 85, NA: 0 | FALSE |
| SPC | 0 | 300 | 300 | 7 | stu: 70, non: 64, emp: 59, mid: 40 | FALSE |
| spirituality | 0 | 300 | 300 | 2 | Not: 206, spi: 94, NA: 0 | FALSE |
| Sport | 0 | 300 | 300 | 2 | spo: 179, Not: 121, NA: 0 | FALSE |
| sugar | 0 | 300 | 300 | 2 | No.: 155, sug: 145, NA: 0 | FALSE |
| Tea | 0 | 300 | 300 | 3 | Ear: 193, bla: 74, gre: 33, NA: 0 | FALSE |
| tea.time | 0 | 300 | 300 | 2 | tea: 169, Not: 131, NA: 0 | FALSE |
| tearoom | 0 | 300 | 300 | 2 | Not: 242, tea: 58, NA: 0 | FALSE |
| where | 0 | 300 | 300 | 3 | cha: 192, cha: 78, tea: 30, NA: 0 | FALSE |
| work | 0 | 300 | 300 | 2 | Not: 213, wor: 87, NA: 0 | FALSE |
| variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 300 | 300 | 37.05 | 16.87 | 15 | 23 | 32 | 48 | 90 | ▇▆▃▅▂▂▁▁ |
Let’s use Goodman-Kruskal tau to explore associations between variables in the dataset. Goodman-Kruskal tau can be used to study association when the measurement level of the variable is at least categorical.
Going trough the Goodman-Kruskal tau-matrix, we can observe few interesting associations between variables. Overall, associations are not very strong, but the following associations emerge when taking into account associations which are stronger than the mean tau-value (0.08).
Even though we’ll eventually use all variables in the MCA, let’s observe few of these associations bit more closely, just for the sake of curiosity.
Let’s use MCA to investigate if there are observable differences between groups of people who prefer different types of tea.
tea_mca <- MCA(tea,
quanti.sup = 19, # age as quantitative supplement
quali.sup = c(20:36),
graph = FALSE) # other background variables as categorical
# supplementsggord(tea_mca,
tea$Tea,
arrow = 0.2,
txt = 3,
vec_ext = 1.05,
size = 2,
repel = TRUE) + theme_minimal()Looking the biplot graph, drinkers of different types of tea seem to somewhat differ. People who drink green and black tea are more often buying their tea from specialist shops, prefer unpackaged tea and dont’t mind spending more money. Drinking Earl Gray seems to serve similiar purpose as habitual coffee drinking in Finland – people drink it at work or with friends during lunch.
It should be kept in mind that the proportion of green tea drinkers is very small compared to the friends of Earl Gray so that these conclusions based on biplot are likely not very strong.
The following interactive plot makes it possible to investigate which responses have the strongest contribution to the model. Darker the colour, stronger the contribution.
Scree plot cat be used to visualize the percentage of inertia (eg. in this context, variance) explained by MCA dimensions. From the plot we can see that the first dimension explains approximately 10% percent of variance and the second one 8%.
Vehkalahti, Kimmo. 2016. “The Relationship Between Learning Approaches and Students’ Achievements in an Introductory Statistics Course in Finland.” In Proceedings of the 60th ISI World Statistics Congress of the International Statistical Institute, ISI2015, 68–73. http://hdl.handle.net/10138/163015.